About the project

I am feeling excited. I expect to learn basically R in this course.  I received an email about this course from university of Helsinki. Here is my forked repository:

GitHub repository : https://github.com/chamalee/IODS-project


Regression and model validation

This week I have covered data wrangling and linear regression analysis.

date()
## [1] "Mon Dec  6 02:38:06 2021"
learning2014 <- read.csv(file = "data/learning2014.csv", header = TRUE)
str(learning2014)
## 'data.frame':    166 obs. of  8 variables:
##  $ X       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
dim(learning2014)
## [1] 166   8

1. Information of dataset

This dataset contains data of 183 participants from a course which was collected in 2014. This dataset has 7 columns as follows.

  1. Gender (Gender: M (Male), F (Female))
  2. Age (in years) derived from the date of birth
  3. Attitude (Global attitude towards statistics)
  4. Deep (ASSIST scale)
  5. Strategic (ASSIST scale)
  6. Surface (ASSIST scale)
  7. Points (Exam points)

Background of measures:

Measures A and C are based on parts A and C in ASSIST (Approaches and Study Skills Inventory for Students) http://www.etl.tla.ed.ac.uk/publications.html#measurement and D is based on SATS (Survey of Attitudes Toward Statistics) http://www.evaluationandstatistics.com/

The items of the central measure in this study (ASSIST B) are named so that the connections to the corresponding dimensions (Deep/SUrface/STrategic) can be easily seen.

2. Graphical overview of the data and summaries of the variables

# access the GGally and ggplot2 libraries
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)

# create a more advanced plot matrix with ggpairs() : draws all possible scatter plots from the columns of a data frame, resulting in a scatter plot matrix.
p <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p

Interpret the results

From above graphical overview, we can see the relationships between the variables. It shows the correlations between different variables and distributions of each variables with respect to ‘gender’ variable.

The colour of plots is defined with the ‘gender’ variable. Pink color plots correspond to Female and green color for Male.

In this graphical overview, we can see the correlations between all variables. We have a negative correlation between points and age. The trend is that when age increases points decreases. And there is a positive correlation between points are attitudes. The trend is that when attitude increases point increases. If we consider correlation between surf variable and deep variable, there is a strong correlation in males where as a weak correlation in females. However, we can also observe that there are approximately twice number of more female than male.

3. Fit a regression model

# a scatter plot of points versus attitude
library(ggplot2)
qplot(attitude, points, data = learning2014) + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

# fit a linear model # create a regression model with multiple explanatory variables
my_model <- lm(points ~ attitude + stra + deep, data = learning2014)

# print out a summary of the model
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + deep, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.5239  -3.4276   0.5474   3.8220  11.5112 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.3915     3.4077   3.343  0.00103 ** 
## attitude      3.5254     0.5683   6.203 4.44e-09 ***
## stra          0.9621     0.5367   1.793  0.07489 .  
## deep         -0.7492     0.7507  -0.998  0.31974    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 162 degrees of freedom
## Multiple R-squared:  0.2097, Adjusted R-squared:  0.195 
## F-statistic: 14.33 on 3 and 162 DF,  p-value: 2.521e-08

Interpret the results

Linear model: formula = points ~ attitude + stra + deep

The best model is found by minimizing the prediction errors (residuals) that model would make. Goal is to minimize the sum of squared residuals. The parameters of residuals are as follows.

Parameters of Residuals:

Min -17.5239, 1Q -3.4276, Median 0.5474, 3Q 3.8220, Max 11.5112.

Coefficients give the estimates of the model parameters corresponding to each variable along with the intercept. Intercept corresponds to the value in y-axis where linear model intersects y-axis.

Coefficients:

(Intercept) 11.3915
attitude 3.5254
stra 0.9621
deep -0.7492

P-values:

The P-value which corresponds to attitude is very small. Therefore, we can conclude there is a strong correlation between the points variable and attitude variable. Due to high p-value of deep variable, there is the least correlation between points variable and deep variable. P-values are as follows.

attitude 4.44e-09 *** stra 0.07489 .
deep 0.31974

We can even observe standard error values of parameter estimates from Std. Error column in the summary - table of coefficients.

Remove less correlated variable

deep variable does not have a statistically significant relationship with the target variable; points.

# a scatter plot of points versus attitude
library(ggplot2)
qplot(attitude, points, data = learning2014) + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

# fit a linear model # create a regression model with multiple explanatory variables
my_model <- lm(points ~ attitude + stra , data = learning2014)

# print out a summary of the model
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9729     2.3959   3.745  0.00025 ***
## attitude      3.4658     0.5652   6.132 6.31e-09 ***
## stra          0.9137     0.5345   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

4. Relationship between explanatory variables and the target variables

Here, it was considered, points as the target variable and attitude, stra and deep as the explanatory variables. It implies we are trying to estimate points variable as a multiple linear model of attitude, stra and deep variables. We can see that there is a statistically significant relationship between exam points and attitude, but not with deep. This means that deep does not interpret anything about exam points, but the attitude do interpret. Therefore, deep variable was removed from the consideration and remodelled only with attitude and stra variables.

Multiple R squared of the model

With attitude + stra + deep variables -> Multiple R-squared: 0.2097 -> approximately 20% With attitude + stra variables -> Multiple R-squared: 0.2048 -> approximately 20%

Multiple R-squared implies the multiple correlation coefficient between three or more variables. It implies how strong the linear relationship is. In other words, how well the regression model fits the observed data. This value ranges from 0 to 1. For example, an R-squared of 20% reveals that 20% of the data fit the regression model. Generally, a higher R-squared indicates a better fit for the model. In this case, we can see after we remove deep variable, Multiple R-squared approximately remains unchanged. That is because deep variable does not have a high impact towards point variable.

5. Diagnostic plots

# a scatter plot of points versus attitude
# create a regression model with multiple explanatory variables
my_model2 <- lm(points ~ attitude + stra, data = learning2014)

# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
par(mfrow = c(2,2))
plot(my_model2, which = c(1,2,5))

The assumptions of the model

Model assumptions in linear regression are as follows.

  1. In a linear regression model an obvious assumption is linearity: the target variable is modelled as a linear combination of the model parameters.
  2. The errors are normally distributed.

Analyzing the residuals of the model provides a method to explore the validity of model assumptions.

Assumptions related to errors are as follows.

  1. The errors are normally distributed.
  2. The errors are not correlated.
  3. The errors have a constant variance.
  4. The size of a given error does not depend on the explanatory variables. (constant variance of errors)

Interpret the validity of those assumptions based on the diagnostic plots

Residuals vs Fitted-plot provides a method to explore the assumption that the size of a given error does not depend on the explanatory variables. Any pattern will imply a problem in the assumption. As we can see in above Residuals vs Fitted-plot, there is a reasonable random spread of points. We can not observe a pattern; but a constant variance. It confirms the assumption that the size of the errors does not depend on the explanatory variables.

Normal QQ-plot provides a method to explore the assumption that errors of the model are normally distributed. As we can see in above QQ-plot, there is a reasonable fit in the middle. But in the corners, there is a clear deviation which makes the normality assumption being questionable.

The Residuals vs leverage shows how much impact single point has on the model. It helps to identify which points have an unusually high impact. As we can see in above Residuals vs leverage plot, there are no highly unusual outliers.


Logistic regression

This week I have covered data wrangling and logistic regression analysis.

date()
## [1] "Mon Dec  6 02:38:17 2021"

2. Information of dataset

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# read the data into memory
alc <- read.table("https://github.com/rsund/IODS-project/raw/master/data/alc.csv", sep=",", header=TRUE)

dim(alc)
## [1] 370  51
glimpse(alc)
## Rows: 370
## Columns: 51
## $ school     <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP",…
## $ sex        <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F",…
## $ age        <int> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15,…
## $ address    <chr> "R", "R", "R", "R", "R", "R", "R", "R", "R", "U", "U", "U",…
## $ famsize    <chr> "GT3", "GT3", "GT3", "GT3", "GT3", "GT3", "GT3", "LE3", "LE…
## $ Pstatus    <chr> "T", "T", "T", "T", "T", "T", "T", "T", "T", "A", "A", "T",…
## $ Medu       <int> 1, 1, 2, 2, 3, 3, 3, 2, 3, 3, 4, 1, 1, 1, 1, 1, 2, 2, 2, 3,…
## $ Fedu       <int> 1, 1, 2, 4, 3, 4, 4, 2, 1, 3, 3, 1, 1, 1, 2, 2, 1, 2, 3, 2,…
## $ Mjob       <chr> "at_home", "other", "at_home", "services", "services", "ser…
## $ Fjob       <chr> "other", "other", "other", "health", "services", "health", …
## $ reason     <chr> "home", "reputation", "reputation", "course", "reputation",…
## $ guardian   <chr> "mother", "mother", "mother", "mother", "other", "mother", …
## $ traveltime <int> 2, 1, 1, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 1, 1, 1, 3, 1, 2, 1,…
## $ studytime  <int> 4, 2, 1, 3, 3, 3, 3, 2, 4, 4, 2, 1, 2, 2, 2, 2, 3, 4, 1, 2,…
## $ schoolsup  <chr> "yes", "yes", "yes", "yes", "no", "yes", "no", "yes", "no",…
## $ famsup     <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye…
## $ activities <chr> "yes", "no", "yes", "yes", "yes", "yes", "no", "no", "no", …
## $ nursery    <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "no"…
## $ higher     <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye…
## $ internet   <chr> "yes", "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes…
## $ romantic   <chr> "no", "yes", "no", "no", "yes", "no", "yes", "no", "no", "n…
## $ famrel     <int> 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 5, 5, 3, 3,…
## $ freetime   <int> 1, 3, 3, 3, 2, 3, 2, 1, 4, 3, 3, 3, 3, 4, 3, 2, 2, 1, 5, 3,…
## $ goout      <int> 2, 4, 1, 2, 1, 2, 2, 3, 2, 3, 2, 3, 2, 2, 2, 3, 2, 2, 1, 2,…
## $ Dalc       <int> 1, 2, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1,…
## $ Walc       <int> 1, 4, 1, 1, 3, 1, 2, 3, 3, 1, 1, 2, 3, 2, 1, 2, 1, 1, 1, 1,…
## $ health     <int> 1, 5, 2, 5, 3, 5, 5, 4, 3, 4, 1, 4, 4, 5, 5, 1, 4, 3, 5, 3,…
## $ n          <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ id.p       <int> 1096, 1073, 1040, 1025, 1166, 1039, 1131, 1069, 1070, 1106,…
## $ id.m       <int> 2096, 2073, 2040, 2025, 2153, 2039, 2131, 2069, 2070, 2106,…
## $ failures   <int> 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2,…
## $ paid       <chr> "yes", "no", "no", "no", "yes", "no", "no", "no", "no", "no…
## $ absences   <int> 3, 2, 8, 2, 5, 2, 0, 1, 9, 10, 0, 3, 2, 0, 4, 1, 2, 6, 2, 1…
## $ G1         <int> 10, 10, 14, 10, 12, 12, 11, 10, 16, 10, 14, 10, 11, 10, 12,…
## $ G2         <int> 12, 8, 13, 10, 12, 12, 6, 10, 16, 10, 14, 6, 11, 12, 12, 14…
## $ G3         <int> 12, 8, 12, 9, 12, 12, 6, 10, 16, 10, 15, 6, 11, 12, 12, 14,…
## $ failures.p <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ paid.p     <chr> "yes", "no", "no", "no", "yes", "no", "no", "no", "no", "no…
## $ absences.p <int> 4, 2, 8, 2, 2, 2, 0, 0, 6, 10, 0, 6, 2, 0, 6, 0, 0, 4, 4, 2…
## $ G1.p       <int> 13, 13, 14, 10, 13, 11, 10, 11, 15, 10, 15, 11, 13, 12, 13,…
## $ G2.p       <int> 13, 11, 13, 11, 13, 12, 11, 10, 15, 10, 14, 12, 12, 12, 12,…
## $ G3.p       <int> 13, 11, 12, 10, 13, 12, 12, 11, 15, 10, 15, 13, 12, 12, 13,…
## $ failures.m <int> 1, 2, 0, 0, 2, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3,…
## $ paid.m     <chr> "yes", "no", "yes", "yes", "yes", "yes", "no", "yes", "no",…
## $ absences.m <int> 2, 2, 8, 2, 8, 2, 0, 2, 12, 10, 0, 0, 2, 0, 2, 2, 4, 8, 0, …
## $ G1.m       <int> 7, 8, 14, 10, 10, 12, 12, 8, 16, 10, 14, 8, 9, 8, 10, 16, 1…
## $ G2.m       <int> 10, 6, 13, 9, 10, 12, 0, 9, 16, 11, 15, 0, 10, 11, 11, 15, …
## $ G3.m       <int> 10, 5, 13, 8, 10, 11, 0, 8, 16, 11, 15, 0, 10, 11, 11, 15, …
## $ alc_use    <dbl> 1.0, 3.0, 1.0, 1.0, 2.5, 1.0, 2.0, 2.0, 2.5, 1.0, 1.0, 1.5,…
## $ high_use   <lgl> FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE,…
## $ cid        <int> 3001, 3002, 3003, 3004, 3005, 3006, 3007, 3008, 3009, 3010,…
# summary(alc)

This dataset contains data of 370 participants in secondary education of two Portuguese schools. This dataset has 51 columns. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. This combined datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008].

3. My personal hypothesis about relationships of chosen variables with alcohol consumption. ( alc_use,high_use)

The purpose of the analysis is to study the relationships between high/low alcohol consumption and some of the other variables in the data. 4 interesting variables : - sex : I think female student might consume less amount of alcohol with compared to male students. - age : I think higher the age, the highest consumption. - romantic : I expect that having a committed relationship will make the least alcohol consumption. - freetime : I expect that more free time will allow the students for high alcohol consumption. In my point of view, high_use variable will behave approximately similar to alc_use.

4. Graphical representation of the distributions of aforementioned chosen variables

# access the GGally and ggplot2 libraries
library(GGally)
library(ggplot2)
library(dplyr)
alc_questions <- c("alc_use", "high_use", "sex", "age", "romantic", "freetime")

# alc_cols %>%
#   select(alc,one_of(alc_questions))

alc_cols <- alc[,alc_questions]


# create a more advanced plot matrix with ggpairs() : draws all possible scatter plots from the columns of a data frame, resulting in a scatter plot matrix.
p <- ggpairs(alc_cols, mapping = aes(col = sex, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p

p <- ggpairs(alc_cols, mapping = aes(col = high_use, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p

We can Immediately notice that,

  • alc_use has a strong correlation to age and freetime .
  • For alc_use, age mostly affect on female students and freetime mostly affect male students.
  • age has a positive correlation with freetime only on male students.
  • age has a negative correlation with freetime only on female students.
  • We do not see a strong correlation with other variable freetime eventhough I assumed before.
# 
# # access the 'tidyverse' packages dplyr and ggplot2
# library(dplyr); library(ggplot2)
# 
# # define a new column alc_use by combining weekday and weekend alcohol use
# alc <- mutate(alc, alc_use = (Dalc + Walc) / 2)
# 
# # initialize a plot of alcohol use
# g1 <- ggplot(data = alc, aes(x = alc_use, fill = sex))
# 
# # define the plot as a bar plot and draw it
# g1 + geom_bar()
# 
# # define a new logical column 'high_use'
# alc <- mutate(alc, high_use = alc_use > 2)
# 
# # initialize a plot of 'high_use'
# g2 <- ggplot(alc, aes(high_use))
# 
# # draw a bar plot of high_use by sex
# g2 + facet_wrap("sex") + geom_bar()

5. Logistic regression

Use logistic regression to statistically explore the relationship between your chosen variables and the binary high/low alcohol consumption variable as the target variable.

model <- glm(high_use ~ age + freetime, 
              data = alc, 
              family = "binomial")
summary(model)
## 
## Call:
## glm(formula = high_use ~ age + freetime, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3711  -0.8959  -0.7071   1.3148   1.9878  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -5.30605    1.68463  -3.150  0.00163 **
## age          0.19413    0.09738   1.994  0.04620 * 
## freetime     0.37360    0.12149   3.075  0.00210 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 438.07  on 367  degrees of freedom
## AIC: 444.07
## 
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(model)
## (Intercept)         age    freetime 
##  -5.3060538   0.1941321   0.3735975

What we can conclude from these coefficients is that:

  • p-value of freetime indicates that it is somewhat significant to high_use.
  • Comparatively high p-value of age w.r.t freetime indicates that it is somewhat less significant to high_use.

Next, we will add romantic to the model.

model2 <- glm(high_use ~ age + freetime + romantic, 
              data = alc, 
              family = "binomial")
summary(model2)
## 
## Call:
## glm(formula = high_use ~ age + freetime + romantic, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4318  -0.8620  -0.7207   1.3094   1.9994  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -5.47316    1.69354  -3.232  0.00123 **
## age          0.20842    0.09841   2.118  0.03418 * 
## freetime     0.37709    0.12165   3.100  0.00194 **
## romanticyes -0.26045    0.25381  -1.026  0.30482   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 437.00  on 366  degrees of freedom
## AIC: 445
## 
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(model2)
## (Intercept)         age    freetime romanticyes 
##  -5.4731577   0.2084192   0.3770938  -0.2604451

Very high p-value of romantic indicates that it is the least significant to high_use. We can conclude here that my previous hypothesis is not fully accurate since romantic does not show a strong correlation with high_use.

Next, we will add sex to the model. It will create my previously stated hypothesis model.

model3 <- glm(high_use ~ age + freetime + romantic + sex, 
             data = alc, 
             family = "binomial")
summary(model3)
## 
## Call:
## glm(formula = high_use ~ age + freetime + romantic + sex, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5445  -0.8506  -0.6576   1.2234   2.1190  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.74785    1.72535  -3.331 0.000864 ***
## age          0.21575    0.09994   2.159 0.030870 *  
## freetime     0.29151    0.12479   2.336 0.019491 *  
## romanticyes -0.20455    0.25873  -0.791 0.429182    
## sexM         0.80662    0.24192   3.334 0.000855 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 425.66  on 365  degrees of freedom
## AIC: 435.66
## 
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(model3)
## (Intercept)         age    freetime romanticyes        sexM 
##  -5.7478538   0.2157461   0.2915068  -0.2045533   0.8066158

Very low p-value of sex indicates that it is the most significant to high_use.

We can finalize the model as follows with the most significant variables.

model4 <- glm(high_use ~ age + freetime  + sex, 
             data = alc, 
             family = "binomial")
summary(model4)
## 
## Call:
## glm(formula = high_use ~ age + freetime + sex, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4981  -0.8659  -0.6393   1.2428   2.0520  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.62126    1.71834  -3.271 0.001070 ** 
## age          0.20483    0.09903   2.068 0.038605 *  
## freetime     0.28665    0.12457   2.301 0.021383 *  
## sexM         0.81972    0.24134   3.396 0.000683 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 426.29  on 366  degrees of freedom
## AIC: 434.29
## 
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(model4)
## (Intercept)         age    freetime        sexM 
##  -5.6212569   0.2048304   0.2866508   0.8197204

let’s see the odds ratios with their CI’s.

OR <- model4 %>% 
  coef %>% 
  exp
CI <- model4 %>% 
  confint %>% 
  exp
## Waiting for profiling to be done...
m_OR <- cbind(OR, CI)
m_OR
##                      OR        2.5 %     97.5 %
## (Intercept) 0.003620088 0.0001161896 0.09987401
## age         1.227316904 1.0125010025 1.49443547
## freetime    1.331958997 1.0464192521 1.70714861
## sexM        2.269865003 1.4192715886 3.66181832

What we can conclude from these coefficients is that:

  • Higher age shows higher probability (1.22) of having higher high_use.
  • Similarly, higher freetime shows higher probability (1.33) of having higher high_use.

6.Cross tabulation of predictions versus the actual values.

2X2 cross tabulation can give us an idea of the predictive power of the model.

probs <- predict(model4, type = "response") # Predict probability responses based on the variables
a_predict <- alc %>%
  mutate(probability = probs) %>% # add them to our data.frame
  mutate(prediction = probability > 0.5) # Compare those probabilities to our significance cutoff to identify the logical value
# Plot the 2X2 cross tabulation of the predictions
table(high_use = a_predict$high_use, 
      prediction = a_predict$prediction) %>%
  prop.table %>%
  addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.66486486 0.03513514 0.70000000
##    TRUE  0.25945946 0.04054054 0.30000000
##    Sum   0.92432432 0.07567568 1.00000000

Approximately 70% of the values were predicted correctly.

Let’s compare the performance with some simple guessing model.

model5 <- glm(high_use ~  failures, 
             data = alc, 
             family = "binomial")
probs <- predict(model5, type = "response") # Predict probability responses based on the variables
a_predict <- alc %>%
 mutate(probability = probs) %>% # add them to our data.frame
 mutate(prediction = probability > 0.5) # Compare those probabilities to our significance cutoff to identify the logical value
# Plot the 2X2 cross tabulation of the predictions
table(high_use = a_predict$high_use, 
     prediction = a_predict$prediction) %>%
 prop.table %>%
 addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.67567568 0.02432432 0.70000000
##    TRUE  0.26756757 0.03243243 0.30000000
##    Sum   0.94324324 0.05675676 1.00000000

In new model too, approximately 70% of the values were predicted correctly.

7. Cross-validation

Cross-validation is a method of testing a predictive model on unseen data. In cross-validation, the value of a penalty (loss) function (mean prediction error) is computed on data not used for finding the model. Low value = good.

Cross-validation gives a good estimate of the actual predictive power of the model. It can also be used to compare different models or classification methods.

# the logistic regression model m and dataset alc (with predictions) are available

# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] NaN
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = model4, K = 10)

Clustering and classification

This week I have covered clustering and classification and data wrangling for next week exercise.

date()
## [1] "Mon Dec  6 02:38:29 2021"

2. Information of Boston dataset from the MASS package

# access the MASS package
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
# load the data
data("Boston")

# explore the dataset
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

This dataset contains information of 506 houses in Boston suburbs. This dataset has 14 columns. The data attributes include features such as per capita crime rate by town, proportion of residential land zoned for lots over 25,000 sq.ft., social and Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)) and so on. You may see the all the features explained in brief.

crim :per capita crime rate by town. zn : proportion of residential land zoned for lots over 25,000 sq.ft. indus : proportion of non-retail business acres per town. chas :Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). nox : nitrogen oxides concentration (parts per 10 million). rm :average number of rooms per dwelling. age :proportion of owner-occupied units built prior to 1940. dis :weighted mean of distances to five Boston employment centres. rad :index of accessibility to radial highways. tax :full-value property-tax rate per $10,000. ptratio : pupil-teacher ratio by town. black :1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town. lstat : lower status of the population (percent). medv : median value of owner-occupied homes in $1000s.

3.1 Graphical overview of the data and show summaries of the variables in the data

# MASS, corrplot, tidyr and Boston dataset are available
library(tidyr)
library(corrplot)
## corrplot 0.92 loaded
# plot matrix of the variables
pairs(Boston)

# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(digits = 2)

# print the correlation matrix
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)

# access the GGally and ggplot2 libraries
library(GGally)
library(ggplot2)
library(dplyr)



# create a more advanced plot matrix with ggpairs() : draws all possible scatter plots from the columns of a data frame, resulting in a scatter plot matrix.
p <- ggpairs(Boston, mapping = aes( alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)),upper = list(continuous = wrap("cor", size = 2)))
# draw the plot
p

3.2 Interpretation of results

We can Immediately notice that,

  • dis has a strong negative correlation with indus. This is due to the fact that when proportion of non-retail business acres per town increases weighted mean of distances to five Boston employment centres decreases.
  • We do not see a strong correlation between chas variable and rad. That is because index of accessibility to radial highways does not depend on Charles River dummy variable. These two variables can be considered as almost independent.
  • nox has a strong positive correlation to indus. It is obvious that nitrogen oxides concentration increases when proportion of non-retail business acres per town increases.
  • rad has a strong correlation to indus, nox, crim and age. That is due to the fact that index of accessibility to radial highways has a positive relationship with proportion of non-retail business acres per town, nitrogen oxides concentration, per capita crime rate by town and proportion of owner-occupied units built prior to 1940.
  • rn has a positive correlation with zn. That is because average number of rooms per dwelling increases with the proportion of residential land zoned for lots over 25,000 sq.ft.

4.1 Standardization

  • In the Standardization, we subtract the column means from the corresponding columns and divide the difference with standard deviation.
  • The Boston data contains only numerical values, so we can use the function scale() to standardize the whole dataset.
# center and standardize variables
boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix" "array"
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

4.2 How did the variables change?

As we can observe above from the summary of the variables, we can notice that the distribution of values of each column variables have been scaled such that mean of the column data is zero and standard deviation is 1. Max and min values corresponding to a particular column too have adjusted accordingly.

4.3 Creating a categorical variable.

We can create a categorical variable from a continuous one. There are many ways to to do that. Let’s choose the variable crim (per capita crime rate by town) to be our factor variable. We want to cut the variable by quantiles to get the high, low and middle rates of crime into their own categories.

# summary of the scaled crime rate
summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))

# look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127

4.4 Drop the old crime rate variable from the dataset.

# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

4.5 Divide the dataset to train and test sets, so that 80% of the data belongs to the train set.

Divide and conquer: train and test sets When we want to use a statistical method to predict something, it is important to have data to test how well the predictions fit. Splitting the original data to test and train sets allows us to check how well our model works.

The training of the model is done with the train set and prediction on new data is done with the test set. This way you have true classes / labels for the test data, and you can calculate how well the model performed in prediction.

Time to split our data!

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

5.1 Linear discriminant analysis.

Linear Discriminant analysis is a classification (and dimension reduction) method. It finds the (linear) combination of the variables that separate the target variable classes. The target can be binary or multiclass variable. Linear discriminant analysis is closely related to many other methods, such as principal component analysis.

We are going to fit a linear discriminant analysis using the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables.

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2475248 0.2400990 0.2623762 0.2500000 
## 
## Group means:
##                   zn      indus        chas        nox         rm        age
## low       1.03361184 -0.8913504 -0.07547406 -0.8994663  0.4462966 -0.9015960
## med_low  -0.09537852 -0.2671521  0.01179157 -0.5583937 -0.1372268 -0.2741190
## med_high -0.37236182  0.1652254  0.21052285  0.3926004  0.1333123  0.4600409
## high     -0.48724019  1.0149946 -0.07742312  1.0048834 -0.2997713  0.8123132
##                 dis        rad        tax     ptratio       black        lstat
## low       0.9330350 -0.6901606 -0.7316320 -0.49771857  0.37645566 -0.762822619
## med_low   0.3736612 -0.5615559 -0.5208559 -0.07041323  0.31193464 -0.107924026
## med_high -0.3713032 -0.4390580 -0.3356933 -0.31107394  0.07263347  0.001671566
## high     -0.8462981  1.6596029  1.5294129  0.80577843 -0.78037825  0.881057981
##                 medv
## low       0.52290304
## med_low  -0.01847535
## med_high  0.18845404
## high     -0.65540803
## 
## Coefficients of linear discriminants:
##                  LD1           LD2         LD3
## zn       0.144377177  0.7107593424 -0.97880651
## indus    0.004127981 -0.0176122565  0.34927961
## chas    -0.089939542 -0.0617272507  0.06915370
## nox      0.261866853 -0.8120444047 -1.33226224
## rm      -0.111405213 -0.0661006148 -0.12562830
## age      0.217671351 -0.4174882404 -0.07455963
## dis     -0.163212638 -0.2344626664  0.25900495
## rad      3.675361452  1.0036025128  0.05358957
## tax      0.031496804 -0.1265091206  0.33795749
## ptratio  0.192185043 -0.0433414483 -0.27840869
## black   -0.137017089  0.0004039264  0.10499702
## lstat    0.184224019 -0.1996122200  0.33495885
## medv     0.191554575 -0.3422156723 -0.23611262
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9551 0.0350 0.0099

5.2 LDA with a biplot.

LDA can be visualized with a biplot.

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

It can be observed that rad has the most influence over the separation of the model. Moreover, it implies that rad variable is more likely to influence the most on the crim variable.

6. Linear discriminant analysis over test data.

Save the crime categories from the test set and then remove the categorical crime variable from the test dataset. Then predict the classes with the LDA model on the test data. Cross tabulate the results with the crime categories from the test set. Comment on the results.

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       11      12        4    0
##   med_low    7      14        8    0
##   med_high   0       5       13    2
##   high       0       0        1   25

We can observe that diagonals of the cross tabulated table (14,18,11,22) represent correctly classified crime categories where as off-diagonals represent the mis-classifications. Correct classifications will be interpreting low as low, med low as med low, med high as med high and high as high. All others are mis-classifications. Correct classifications = 14+18+11+22 = 65 Total mis-classifications = 5+16+1+2+3+10= 37

7.1 Clustering.

Reload the Boston dataset and standardize the dataset. Calculate the distances between the observations.

# load MASS and Boston
library(MASS)
data('Boston')

# center and standardize variables
boston_scaled <- scale(Boston)


# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)


# euclidean distance matrix
dist_eu <- dist(Boston)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.119  85.624 170.539 226.315 371.950 626.047
# manhattan distance matrix
dist_man <- dist(Boston, method = 'manhattan')

# look at the summary of the distances
summary(dist_man)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    2.016  149.145  279.505  342.899  509.707 1198.265

7.2 K-means clustering

K-means is maybe the most used and known clustering method. It is an unsupervised method, that assigns observations to groups or clusters based on similarity of the objects. In the previous exercise we got a hang of distances. The kmeans() function counts the distance matrix automatically, but it is good to know the basics. Let’s cluster a bit!

# k-means clustering
km <-kmeans(Boston, centers = 3)

# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)

7.3 Investigate the optimal number of clusters

K-means needs the number of clusters as an argument. There are many ways to look at the optimal number of clusters and a good way might depend on the data you have.

One way to determine the number of clusters is to look at how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes. When you plot the number of clusters and the total WCSS, the optimal number of clusters is when the total WCSS drops radically.

K-means might produce different results every time, because it randomly assigns the initial cluster centers. The function set.seed() can be used to deal with that.

Steps are as follows.

  • Set the max number of clusters (k_max) to be 10
  • Execute the code to calculate total WCSS. This might take a while.
  • Visualize the total WCSS when the number of cluster goes from 1 to 10.
set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

7.4 Interpretation of results of total WCSS

The optimal number of clusters is when the value of total WCSS changes radically. In this case, we observe that radical change from 1 to 2. Hence, two clusters would seem optimal.

Next, Run kmeans() again with two clusters and visualize the results.

# k-means clustering
km <-kmeans(Boston, centers = 2)

# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)

Bonus

Perform k-means on the original Boston data with some reasonable number of clusters (> 2). Remember to standardize the dataset. Then perform LDA using the clusters as target classes. Include all the variables in the Boston data in the LDA model. Visualize the results with a biplot (include arrows representing the relationships of the original variables to the LDA solution). Interpret the results. Which variables are the most influencial linear separators for the clusters?

data("Boston")

boston_scaled <- Boston %>%
  scale %>%
  as.data.frame

set.seed(123) 
# k-means clustering
km <- kmeans(boston_scaled, centers = 3)

# clusters
boston_b <- boston_scaled %>% 
  mutate(crim = as.factor(km$cluster))

# number of rows in the Boston dataset 
n <- nrow(boston_b)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_b

# create test set
test <- boston_b

correct_classes <- test$crim

test <- dplyr::select(test, -crim)

#  perform LDA using the clusters as target classes
lda.fit <- lda(crim ~ ., data = train)

# target classes as numeric
classes <- as.numeric(train$crim)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

  • It can be observed that rad has the most influence for the separation of the model.

  • Moreover, it implies that rad variable is influencing the most into cluster 1.

  • Next, age variable is the second most influential variable for the separation of the model.

  • age variable is highly likely to separate cluster 2 and 3 with respect to LD 2 axis.


Dimensionality reduction techniques

This week I have covered Dimensionality reduction techniques and data wrangling exercise.

date()
## [1] "Mon Dec  6 02:39:00 2021"

0. Information of prepared Human dataset

# read the human data from URL
human <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt", sep  =",", header = T)
# human <- read.table("data/human.csv", sep = ";")

dim(human)
## [1] 155   8
# look at the (column) names of human
names(human)
## [1] "Edu2.FM"   "Labo.FM"   "Edu.Exp"   "Life.Exp"  "GNI"       "Mat.Mor"  
## [7] "Ado.Birth" "Parli.F"

The origin of the data is the United Nations Development Programme. This dataset contains information of 155 countries. This dataset has 8 columns. The data attributes include features such as Gross National Income, Life expectancy at birth, Expected years of schooling and so on.

You may see all features explained in brief.

  • Edu2.FM : Proportion of females with at least secondary education / Proportion of males with at least secondary education
  • Labo.FM : Proportion of females in the labour force / Proportion of males in the labour force
  • Edu.Exp : Expected years of schooling
  • Life.Exp : Life expectancy at birth
  • GNI : Gross National Income
  • Mat.Mor : Maternal mortality ratio
  • Ado.Birth : Adolescent birth rate
  • Parli.F : Percetange of female representatives in parliament
# look at the structure of human
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
# print out summaries of the variables
summary(human)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

1.1 Graphical overview of the data

# Access GGally
library(GGally)
library(corrplot)

# visualize the 'human' variables
ggpairs(human)

# compute the correlation matrix and visualize it with corrplot
cor(human) %>% corrplot

corrplot.mixed(cor(human, use = "pairwise"), order ="hclust")

1.2 Interpretation of results

We can Immediately notice that,

  • Mat.Mor has a strong negative correlation with Life.Exp. This is due to the fact that when Maternal mortality ratio increases Life expectancy at birth decreases.
  • We do not see a strong correlation between GNI variable and Labo.FM. That is because Gross National Income does not depend on Proportion of females in the labour force / Proportion of males in the labour force variable. These two variables can be considered as almost independent.
  • Life.Exp has a strong positive correlation to Edu.Exp. It is obvious that Expected years of schooling increases when Life expectancy at birth increases.
  • Life.Exp has a strong correlation to both Mat.Mor and Ado.Birth. That is due to the fact that Life expectancy at births has a negative relationship with both Maternal mortality ratio and Adolescent birth rate.
  • Ado.Birth has a positive correlation with Mat.Mor. That is because Maternal mortality ratio increases with Adolescent birth rate.
  • We do not see a strong correlation between Edu.Exp variable and Labo.FM. That is because Expected years of schooling does not depend on Proportion of females in the labour force / Proportion of males in the labour force variable. These two variables can be considered as almost independent.

2.1 Principal component analysis (PCA)

Principal Component Analysis (PCA) can be performed by two sightly different matrix decomposition methods from linear algebra: the Eigenvalue Decomposition and the Singular Value Decomposition (SVD).

There are two functions in the default package distribution of R that can be used to perform PCA: princomp() and prcomp(). The prcomp() function uses the SVD and is the preferred, more numerically accurate method.

Both methods quite literally decompose a data matrix into a product of smaller matrices, which let’s us extract the underlying principal components. This makes it possible to approximate a lower dimensional representation of the data by choosing only a few principal components.

# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)

# print out summaries of the non-standardized variables
summary(human)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

2.2 PCA with a biplot.

A biplot is a way of visualizing the connections between two representations of the same data. First, a simple scatter plot is drawn where the observations are represented by two principal components (PC’s). Then, arrows are drawn to visualize the connections between the original variables and the PC’s.

The following connections hold:

The angle between the arrows can be interpret as the correlation between the variables. The angle between a variable and a PC axis can be interpret as the correlation between the two. The length of the arrows are proportional to the standard deviations of the variables.

# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

2.3 Interpretation of results

  • In bi-plot, the length of the arrow is proportional to standard deviation of features.
  • Small angle implies high positive correlation.
  • It can be observed that GNI has the largest length arrow indicating highest standard deviation.
  • Angle between first principal component and GNI variable is almost zero which indicates a high positive correlation.
  • Unfortunately, these kinds of results is due to non-standardized data.

3.1 Standardization

# standardize the variables
human_std <- scale(human)

# print out summaries of the standardized variables
summary(human_std)
##     Edu2.FM           Labo.FM           Edu.Exp           Life.Exp      
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7378   Min.   :-2.7188  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6782   1st Qu.:-0.6425  
##  Median : 0.3503   Median : 0.2316   Median : 0.1140   Median : 0.3056  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.7126   3rd Qu.: 0.6717  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 2.4730   Max.   : 1.4218  
##       GNI             Mat.Mor          Ado.Birth          Parli.F       
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_std)

3.2 PCA with a biplot.

# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))

3.3 Interpretation of results

  • We can Immediately notice that results with and without standardizing are different.

  • After standardization, each column variables have been scaled such that mean of the column data is zero and standard deviation is 1.

  • In bi-plot, the length of the arrow is proportional to standard deviation of features/variables.

  • Since data is standardized, we can observe almost similar lengths in all arrows. It concludes that all variables have equal standard deviations.

  • In bi-plot, Small angle implies high positive correlation.

  • We can see that Labo.FM variable is almost orthogonal to Edu.Exp variable. We do not see a strong correlation between Edu.Exp variable and Labo.FM. That is because Expected years of schooling does not depend on Proportion of females in the labour force / Proportion of males in the labour force variable. These two variables can be considered as almost independent.

  • It can be observed that there is a small angle between Mat.Mor and Ado.Birth. They are pointing to the same direction. That is due to the fact that Ado.Birth has a strong positive correlation with Mat.Mor. Maternal mortality ratio increases with Adolescent birth rate.

  • On the other hand, Mat.Mor and Ado.Birth almost align with first principle component. It implies that they contribute to that dimension.

5.1 Load the tea dataset from the package Factominer and explore the dataset

The Factominer package contains functions dedicated to multivariate explanatory data analysis. It contains for example methods (Multiple) Correspondence analysis , Multiple Factor analysis as well as PCA.

In the next exercises we are going to use the tea dataset. The dataset contains the answers of a questionnaire on tea consumption.

library(FactoMineR)

# load the data set from FactoMineR-package
data("tea") 

# the structure and the dimensions of the data and visualize
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36
summary(tea)
##          breakfast           tea.time          evening          lunch    
##  breakfast    :144   Not.tea time:131   evening    :103   lunch    : 44  
##  Not.breakfast:156   tea time    :169   Not.evening:197   Not.lunch:256  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##         dinner           always          home           work    
##  dinner    : 21   always    :103   home    :291   Not.work:213  
##  Not.dinner:279   Not.always:197   Not.home:  9   work    : 87  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##         tearoom           friends          resto          pub     
##  Not.tearoom:242   friends    :196   Not.resto:221   Not.pub:237  
##  tearoom    : 58   Not.friends:104   resto    : 79   pub    : 63  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         Tea         How           sugar                     how     
##  black    : 74   alone:195   No.sugar:155   tea bag           :170  
##  Earl Grey:193   lemon: 33   sugar   :145   tea bag+unpackaged: 94  
##  green    : 33   milk : 63                  unpackaged        : 36  
##                  other:  9                                          
##                                                                     
##                                                                     
##                                                                     
##                   where                 price          age        sex    
##  chain store         :192   p_branded      : 95   Min.   :15.00   F:178  
##  chain store+tea shop: 78   p_cheap        :  7   1st Qu.:23.00   M:122  
##  tea shop            : 30   p_private label: 21   Median :32.00          
##                             p_unknown      : 12   Mean   :37.05          
##                             p_upscale      : 53   3rd Qu.:48.00          
##                             p_variable     :112   Max.   :90.00          
##                                                                          
##            SPC               Sport       age_Q          frequency  
##  employee    :59   Not.sportsman:121   15-24:92   1/day      : 95  
##  middle      :40   sportsman    :179   25-34:69   1 to 2/week: 44  
##  non-worker  :64                       35-44:40   +2/day     :127  
##  other worker:20                       45-59:61   3 to 6/week: 34  
##  senior      :35                       +60  :38                    
##  student     :70                                                   
##  workman     :12                                                   
##              escape.exoticism           spirituality        healthy   
##  escape-exoticism    :142     Not.spirituality:206   healthy    :210  
##  Not.escape-exoticism:158     spirituality    : 94   Not.healthy: 90  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##          diuretic             friendliness            iron.absorption
##  diuretic    :174   friendliness    :242   iron absorption    : 31   
##  Not.diuretic:126   Not.friendliness: 58   Not.iron absorption:269   
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##          feminine             sophisticated        slimming          exciting  
##  feminine    :129   Not.sophisticated: 85   No.slimming:255   exciting   :116  
##  Not.feminine:171   sophisticated    :215   slimming   : 45   No.exciting:184  
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##         relaxing              effect.on.health
##  No.relaxing:113   effect on health   : 66    
##  relaxing   :187   No.effect on health:234    
##                                               
##                                               
##                                               
##                                               
## 
library(tidyr)
# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")

# select the 'keep_columns' to create a new dataset
tea_time <- dplyr::select(tea, one_of(keep_columns))

# look at the summaries and structure of the data
summary(tea_time)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where           lunch    
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
## 
str(tea_time)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
# visualize the dataset
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

  • In the summary, we can see eigenvalues corresponding to total of 11 dimensions.

5.2 Multiple Correspondence Analysis

Multiple Correspondence Analysis (MCA) is a method to analyze qualitative data and it is an extension of Correspondence analysis (CA). MCA can be used to detect patterns or structure in the data as well as in dimension reduction.

# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.279   0.261   0.219   0.189   0.177   0.156   0.144
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519   7.841
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953  77.794
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.117   0.087   0.062
## % of var.              7.705   6.392   4.724   3.385
## Cumulative % of var.  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139   0.003
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626   0.027
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111   0.107
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841   0.127
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979   0.035
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990   0.020
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347   0.102
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459   0.161
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968   0.478
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898   0.141
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            2.867 |   0.433   9.160   0.338  10.053 |
## green               -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone               -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                3.226 |   1.329  14.771   0.218   8.081 |
## milk                 2.422 |   0.013   0.003   0.000   0.116 |
## other                5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag             -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged          -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")

5.3 Interpretation of results

  • In the above MCA factor map, the variables are drawn only for the first two dimensions.

  • The plot above helps to identify categories that are the most correlated with each dimension. The squared correlations between categories and the dimensions are used as coordinates.

  • It can be seen that, the category variables alone and lunch are the most correlated with dimension 1. Similarly, the variables sugar and No.Sugar are the most correlated with dimension 2.

  • Moreover, it is more likely to have unpackaged tea in a tea shop while tea bag tea in chain store. These two pairs can be observed together in the plot.

5.4 MCA with a screeplot.

# draw a screeplot of the MCA 

library("factoextra")
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_screeplot(mca, addlabels = TRUE, ylim = c(0, 45))

  • We can use screeplot to visualize the percentages of inertia explained by each MCA dimensions.
  • It can be seen that highest inertia is with dim 1 and dim-10 has the lowest inertia.

5.5 MCA with a biplot of both individuals and categories.

# draw a biplot of the MCA and the original variables

mca <- MCA(tea_time, graph = FALSE)

library("factoextra")
# eig.val <- get_eigenvalue(mca)
# fviz_screeplot(mca, addlabels = TRUE, ylim = c(0, 45))
fviz_mca_biplot(mca, 
               repel = TRUE, # Avoid text overlapping (slow if many point)
               ggtheme = theme_minimal())
## Warning: ggrepel: 190 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

  • The plot above helps to identify variables that are the most correlated with each dimension. The squared correlations between variables and the dimensions are used as coordinates.

  • Active individuals are in blue in the plot.

  • Active variable categories are in red in the plot.

  • Individuals such as 171,277 and 278 are strongly correlated with dim 1 where as 72,270 and 15 are strongly correlated with dim 2.

Reference:

Datacamp : https://campus.datacamp.com/courses/helsinki-open-data-science/dimensionality-reduction-techniques

(more chapters to be added similarly as we proceed with the course!)